Introduction:

At the beginning of 2020, Covid-19 affects people’s life globally. Every countries enacted several policy to deal with this disease. The increasing unemployment, decreasing GDP, higher inflation and so on are signs to reflect that the economic market is under tremendous risk. For this project, I am wondering whether the different in people’s income would influence the death cases by Covid-19 in the US. Also, I would consider GDP level during Covid-19 period as the confounding variable in our analysis.

Method:

Variable Description:

  • State: 51 States in the US

  • State_full_name: Full name of each state

  • Lon:Longitude

  • Lat:Latitude

  • Income: Median Household Income in United States

  • Urban_rural_code:a classification scheme distinguishes counties by the population

  • Covid_death: Death caused by Covid-19

  • All-Causes death: All death during analysis

  • total_covid_death_instate: total number of death caused by Covid-19 in each state

  • total_all_death_instate: total number of death in each state

  • death_mean_urban: Average number of death caused by Covid-19 in different type of counties.

For the first dateset, I choose to use Median Income for each state in the US provided by United State Census and the link is ‘https://www.census.gov/search-results.html?q=Median+income+&page=1&stateGeo=none&searchtype=web&cssp=SERP&_charset_=UTF-8’. For the second dateset, I choose to use the collection of Covid-19 cases and all-causes death cases in each state and county in the US provided by the CDC and the link is ‘https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy’. For the third dataset, I found the GDP level across each state in the US on the website “https://worldpopulationreview.com/state-rankings/gdp-by-state”.

I need to merge two datasets which contain our main effects variables: Income and death caused by Covid-19 by the variable ‘State’ to get a full dataset which is helpful for the further analysis. Then, I delete the comma occurred in some numerical number such as changing 14,500 to 14500 in order to better run the data in R. For the next step, I renamed certain variables that include ‘space’ like changing “urban rural code” to “urban_rual_code” as a whole word. Before providing some statistical result, the most important step is to check the missing value occurs in our data. For any observations with the missing value for the death cases, I just replaced them with 0. In order to better summary the key outcome by the variable ‘state’, I created new variables to reflect the total death cases in each state. For analyzing our confounding variable, we just combined our existing date ‘covid1’ with the GDP data and for a new dateset called ‘gdp_incme_covid’. For this combined data, we would measure the association between GDP level and Covid-19 deaths and the association between GDP level and Income. Since the GDP data we choose is distince enough, so we don’t need to clean this combined dataset anymore. Then, I created a table to show the details of each key variable. The table contains six variables which classified by State: the full name of the state, number of counties, GDP, Income, COVID-19 death cases and all-caused death cases. For the data visualization, I plotted 4 graphs to show the association between each key variables. For example, I used draw a US map to show the density of COVID-19 death in each state and draw a scatter plot to reflect the linear association between Income and number of Covid-19 death cases.

Preliminary Results:

We checked the dimension of our data and noticed that there are 3023 total observations and 17 different factors for each of our observation. Then, I did some summaries for the key variables such as Income, GDP, Covid-19 death cases and all caused death cases. I found people living in Mississippi has the the lowest median income which is $45081 and people living in District of Coloumbia has the the highest median income which is $86420. Also, I noticed that the lowest death cases caused by COVID-19 is in Vermont which equals to 283 and highest death cases caused by COVID-19 in California which equals to 73920 and mean death cases caused by COVID-19 in the US is 20504. For the variable GDP, I found that the state Vermont also has the lowest GDP which equals to 33278 million dollar and the state California has the highest GDP which equals to 3120386 million dollar.

This graph provide the distribution of Covid-19 death cases visualized by US map. If the state contains more cases, the color of that state would more closely tend to blue. We noticed that California, Florida, New York and Texas contains much more COVID-19 death than other states. To be detail, during the period from 01/01/2020 to 10/20/2021 California has 73920 Covid-19 death cases, Texas has 72436 Covid-19 death cases, New York has 57508 Covid-19 death cases and Florida has 56496 Covid-19 death cases.

For the second plot, we measured the distribution of Income classified by state visualized by bar plot. We noticed that the range of Income between each state is relatively large which equals to 41339. The state Mississippi with the lowest median income which equals to 45081 and the state District of Columbia with the highest median income which equals to 86420.

This graph is about the association between different urban-rural classification and COVID-19 death cases. We found that there is not a clear linear association. We cannot say that if the counties contains more population, It would be more COVID-19 death cases. It’s clear to notice that as counties defined as ‘Median metro’ have more Covid-19 deaths than counties defined as ‘Large fringe metro’. Also, there is a larger amount Covid-19 death cases in counties defined as ‘small metro’ than counties defined as ‘micropolitan’.

This graph is the reflection of the relationship between our two main variable: Median Income and death due to Covid-19. We used scatter plot with a smooth line to detect the association. However, the pattern is not clear and looks like a normal distribution since those 4 states which contain especially high value of Covid-19 death cases affect a lot to the overall association. For the next step, we would consider GDP as a confounding variable.

Find out whether GDP is a confounding variable and affect the association between Income and death caused by Covid-19

We use the bar chart to find out the GDP level in each State. From the graph, we noticed that Top 3 high GDP state is California, New York and Texas. California has the highest GDP which equals to 3120386 million dollar. GDP in Texas equals to 1772132 million dollar and in New York equals to 1705127 million dollar.

The scatter plot with a smooth line measures the association between Median income and GDP Level. It looks like a positive linear assciation, but the slope is very small.

This scatter plot with smooth line measures the association between GDP and Covid-19 death cases. We can easily find that there is a strong positive linear association between GDP and Covid-19 death cases. The variable GDP is associated with both Median Income and Covid-19 death cases, so we would say GDP is a confounding variable for our main analysis. This is a very important find since we would do furthur analysis after controlling the variable GDP.

Conclusion

We collect the information about the median Income and COVID-19 death for all 50 States in the US. Four of those state which are LA,TX,NY and FL have the higher COVID-19 death cases than other states. For the Median income for people living in CA,TX and NY are over $60,000 which is a relative large value, but for the linear association between income and COVID-19 deaths, there is not a clear pattern. Also, GDP is considered as confounding variable in our analysis and needed to be controlled. For the further analysis, I would introduce more variables like race, gender to show whether they confounded the association between income and COVID-19 deaths.